45 research outputs found
Supporting Human-AI Collaboration in Auditing LLMs with LLMs
Large language models are becoming increasingly pervasive and ubiquitous in
society via deployment in sociotechnical systems. Yet these language models, be
it for classification or generation, have been shown to be biased and behave
irresponsibly, causing harm to people at scale. It is crucial to audit these
language models rigorously. Existing auditing tools leverage either or both
humans and AI to find failures. In this work, we draw upon literature in
human-AI collaboration and sensemaking, and conduct interviews with research
experts in safe and fair AI, to build upon the auditing tool: AdaTest (Ribeiro
and Lundberg, 2022), which is powered by a generative large language model
(LLM). Through the design process we highlight the importance of sensemaking
and human-AI communication to leverage complementary strengths of humans and
generative models in collaborative auditing. To evaluate the effectiveness of
the augmented tool, AdaTest++, we conduct user studies with participants
auditing two commercial language models: OpenAI's GPT-3 and Azure's sentiment
analysis model. Qualitative analysis shows that AdaTest++ effectively leverages
human strengths such as schematization, hypothesis formation and testing.
Further, with our tool, participants identified a variety of failures modes,
covering 26 different topics over 2 tasks, that have been shown before in
formal audits and also those previously under-reported.Comment: 21 pages, 3 figure
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models
Large language models have demonstrated great potential to assist programmers
in generating code. For such human-AI pair programming scenarios, we
empirically demonstrate that while generated code is most often evaluated in
terms of their functional correctness (i.e., whether generations pass available
unit tests), correctness does not fully capture (e.g., may underestimate) the
productivity gains these models may provide. Through a user study with N = 49
experienced programmers, we show that while correctness captures high-value
generations, programmers still rate code that fails unit tests as valuable if
it reduces the overall effort needed to complete a coding task. Finally, we
propose a hybrid metric that combines functional correctness and syntactic
similarity and show that it achieves a 14% stronger correlation with value and
can therefore better represent real-world gains when evaluating and comparing
models.Comment: Accepted at ACL 2023 (Findings
Recommended from our members
Power to the People: The Role of Humans in Interactive Machine Learning
Systems that can learn interactively from their end-users are quickly becoming widespread. Until recently, this progress has been fueled mostly by advances in machine learning; however, more and more researchers are realizing the importance of studying users of these systems. In this article we promote this approach and demonstrate how it can result in better user experiences and more effective learning systems. We present a number of case studies that demonstrate how interactivity results in a tight coupling between the system and the user, exemplify ways in which some existing systems fail to account for the user, and explore new ways for learning systems to interact with their users. After giving a glimpse of the progress that has been made thus far, we discuss some of the challenges we face in moving the field forward.This is an author's peer-reviewed final manuscript, as accepted by the publisher. The published article is copyrighted by the American Association for Artificial Intelligence and can be found at: http://www.aaai.org/Magazine/magazine.php
ICE: Enabling Non-Experts to Build Models Interactively for Large-Scale Lopsided Problems
Quick interaction between a human teacher and a learning machine presents
numerous benefits and challenges when working with web-scale data. The human
teacher guides the machine towards accomplishing the task of interest. The
learning machine leverages big data to find examples that maximize the training
value of its interaction with the teacher. When the teacher is restricted to
labeling examples selected by the machine, this problem is an instance of
active learning. When the teacher can provide additional information to the
machine (e.g., suggestions on what examples or predictive features should be
used) as the learning task progresses, then the problem becomes one of
interactive learning.
To accommodate the two-way communication channel needed for efficient
interactive learning, the teacher and the machine need an environment that
supports an interaction language. The machine can access, process, and
summarize more examples than the teacher can see in a lifetime. Based on the
machine's output, the teacher can revise the definition of the task or make it
more precise. Both the teacher and the machine continuously learn and benefit
from the interaction.
We have built a platform to (1) produce valuable and deployable models and
(2) support research on both the machine learning and user interface challenges
of the interactive learning problem. The platform relies on a dedicated,
low-latency, distributed, in-memory architecture that allows us to construct
web-scale learning machines with quick interaction speed. The purpose of this
paper is to describe this architecture and demonstrate how it supports our
research efforts. Preliminary results are presented as illustrations of the
architecture but are not the primary focus of the paper
Predicting Academic Success Based on Learning Material Usage
In this work, we explore students' usage of online learning material as a predictor of academic success. In the context of an introductory programming course, we recorded the amount of time that each element such as a text paragraph or an image was visible on the students' screen. Then, we applied machine learning methods to study to what extent material usage predicts course outcomes. Our results show that the time spent with each paragraph of the online learning material is a moderate predictor of student success even when corrected for student time-on-task, and that the information can be used to identify at-risk students. The predictive performance of the models is dependent on the quantity of data, and the predictions become more accurate as the course progresses. In a broader context, our results indicate that course material usage can be used to predict academic success, and that such data can be collected in-situ with minimal interference to the students' learning process.Peer reviewe
Trust in AutoML: Exploring Information Needs for Establishing Trust in Automated Machine Learning Systems
We explore trust in a relatively new area of data science: Automated Machine
Learning (AutoML). In AutoML, AI methods are used to generate and optimize
machine learning models by automatically engineering features, selecting
models, and optimizing hyperparameters. In this paper, we seek to understand
what kinds of information influence data scientists' trust in the models
produced by AutoML? We operationalize trust as a willingness to deploy a model
produced using automated methods. We report results from three studies --
qualitative interviews, a controlled experiment, and a card-sorting task -- to
understand the information needs of data scientists for establishing trust in
AutoML systems. We find that including transparency features in an AutoML tool
increased user trust and understandability in the tool; and out of all proposed
features, model performance metrics and visualizations are the most important
information to data scientists when establishing their trust with an AutoML
tool.Comment: IUI 202
Researching AI Legibility Through Design
Everyday interactions with computers are increasingly likely to involve elements of Artificial Intelligence (AI). Encompassing a broad spectrum of technologies and applications, AI poses many challenges for HCI and design. One such challenge is the need to make AI’s role in a given system legible to the user in a meaningful way. In this paper we employ a Research through Design (RtD) approach to explore how this might be achieved. Building on contemporary concerns and a thorough exploration of related research, our RtD process reflects on designing imagery intended to help increase AI legibility for users. The paper makes three contributions. First, we thoroughly explore prior research in order to critically unpack the AI legibility problem space. Second, we respond with design proposals whose aim is to enhance the legibility, to users, of systems using AI. Third, we explore the role of design-led enquiry as a tool for critically exploring the intersection between HCI and AI research
Emerging Perspectives in Human-Centered Machine Learning
Current Machine Learning (ML) models can make predictions that are as good as or better than those made by people. The rapid adoption of this technology puts it at the forefront of systems that impact the lives of many, yet the consequences of this adoption are not fully understood. Therefore, work at the intersection of people's needs and ML systems is more relevant than ever. This area of work, dubbed Human-Centered Machine Learning (HCML), re-thinks ML research and systems in terms of human goals. HCML gathers an interdisciplinary group of HCI and ML practitioners, each bringing their unique, yet related perspectives. This one-day workshop is a successor of Gillies et al. (2016) and focuses on recent advancements and emerging areas in HCML. We aim to discuss different perspectives on these areas and articulate a coordinated research agenda for the XXI century
Human-Centered Machine Learning
Machine learning is one of the most important and successful techniques in contemporary computer science. It involves the statistical inference of models (such as classifiers) from data. It is often conceived in a very impersonal way, with algorithms working autonomously on passively collected data. However, this viewpoint hides considerable human work of tuning the algorithms, gathering the data, and even deciding what should be modeled in the first place. Examining machine learning from a human-centered perspective includes explicitly recognising this human work, as well as reframing machine learning workflows based on situated human working practices, and exploring the co-adaptation of humans and systems. A human-centered understanding of machine learning in human context can lead not only to more usable machine learning tools, but to new ways of framing learning computationally. This workshop will bring together researchers to discuss these issues and suggest future research questions aimed at creating a human-centered approach to machine learning
Crowdsourcing the Perception of Machine Teaching
Teachable interfaces can empower end-users to attune machine learning systems
to their idiosyncratic characteristics and environment by explicitly providing
pertinent training examples. While facilitating control, their effectiveness
can be hindered by the lack of expertise or misconceptions. We investigate how
users may conceptualize, experience, and reflect on their engagement in machine
teaching by deploying a mobile teachable testbed in Amazon Mechanical Turk.
Using a performance-based payment scheme, Mechanical Turkers (N = 100) are
called to train, test, and re-train a robust recognition model in real-time
with a few snapshots taken in their environment. We find that participants
incorporate diversity in their examples drawing from parallels to how humans
recognize objects independent of size, viewpoint, location, and illumination.
Many of their misconceptions relate to consistency and model capabilities for
reasoning. With limited variation and edge cases in testing, the majority of
them do not change strategies on a second training attempt.Comment: 10 pages, 8 figures, 5 tables, CHI2020 conferenc